POS-tagging of Tunisian Dialect Using Standard Arabic Resources and Tools
نویسندگان
چکیده
Developing natural language processing tools usually requires a large number of resources (lexica, annotated corpora, etc.), which often do not exist for less-resourced languages. One way to overcome the problem of lack of resources is to devote substantial efforts to build new ones from scratch. Another approach is to exploit existing resources of closely related languages. In this paper, we focus on developing a part-of-speech tagger for the Tunisian Arabic dialect (TUN), a lowresource language, by exploiting its closeness to Modern Standard Arabic (MSA), which has many state-of-the-art resources and tools. Our system achieved an accuracy of 89% (∼20% absolute improvement over an MSA tagger baseline).
منابع مشابه
The Effects of Factorizing Root and Pattern Mapping in Translating between Tunisian Arabic and Standard Arabic
The development of natural language processing tools for dialects faces the severe problem of lack of resources. In cases of diglossia, as in Arabic, one variant, Modern Standard Arabic (MSA), has many resources that can be used to build natural language processing tools. Whereas other variants, Arabic dialects, are resource poor. Taking advantage of the closeness of MSA and its dialects, one w...
متن کاملAdapting Standard Open-Source Resources To Tagging A Morphologically Rich Language: A Case Study With Arabic
In this paper we investigate the possibility of creating a PoS tagger for Modern Standard Arabic by integrating open-source tools. In particular a morphological analyser, used in the disambiguation process with a PoS tagger trained on classical Arabic. The investigation shows the scarcity of open-source tools and resources, which complicated the integration process. Among the problems are diffe...
متن کاملCollaboratively Constructed Linguistic Resources for Language Variants and their Exploitation in NLP Application - the case of Tunisian Arabic and the Social Media
Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for the translation of texts of soc...
متن کاملPOS Tagging of Dialectal Arabic: A Minimally Supervised Approach
Natural language processing technology for the dialects of Arabic is still in its infancy, due to the problem of obtaining large amounts of text data for spoken Arabic. In this paper we describe the development of a part-of-speech (POS) tagger for Egyptian Colloquial Arabic. We adopt a minimally supervised approach that only requires raw text data from several varieties of Arabic and a morpholo...
متن کاملMorphological Segmentation and Part of Speech Tagging for Religious Arabic
We annotate a small corpus of religious Arabic with morphological segmentation boundaries and fine-grained segment-based part of speech tags. Experiments on both segmentation and POS tagging show that the religious corpus-trained segmenter and POS tagger outperform the Arabic Treebak-trained ones although the latter is 21 times as big , which shows the need for building religious Arabic linguis...
متن کامل